🌱 fix(e2e): wait for leader election #1676

camilamacedo86 · 2025-01-31T03:42:37Z

TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired.

This fix ensures the test explicitly waits for leader election logs ("successfully acquired lease") before verifying reconciliation.

Example: https://github.com/operator-framework/operator-controller/actions/runs/13047935813/job/36401741998

Logs from operator-controller;

I0130 08:12:51.290826       1 main.go:427] "starting manager" logger="setup"
I0130 08:12:51.291050       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"
I0130 08:12:51.291109       1 server.go:83] "starting server" name="health probe" addr="[::]:8081"
I0130 08:12:51.291121       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8443" secure=true
I0130 08:12:51.291209       1 leaderelection.go:257] attempting to acquire leader lease olmv1-system/9c4404e7.operatorframework.io...

netlify · 2025-01-31T03:43:39Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`25ffe30`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/679cdc61ed84a200085c3178
😎 Deploy Preview	https://deploy-preview-1676--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2025-01-31T04:25:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.48%. Comparing base (e77c53c) to head (25ffe30).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1676      +/-   ##
==========================================
- Coverage   67.50%   67.48%   -0.03%     
==========================================
  Files          57       57              
  Lines        4632     4632              
==========================================
- Hits         3127     3126       -1     
- Misses       1278     1279       +1     
  Partials      227      227

Flag	Coverage Δ
e2e	`53.40% <ø> (ø)`
unit	`54.27% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

camilamacedo86 · 2025-01-31T04:49:46Z

@LalatenduMohanty @joelanford @thetechnick

cmd/operator-controller/main.go

perdasilva · 2025-01-31T13:55:14Z

test/upgrade-e2e/post_upgrade_test.go

@@ -40,6 +40,20 @@ func TestClusterExtensionAfterOLMUpgrade(t *testing.T) {
 	t.Log("Wait for operator-controller deployment to be ready")
 	managerPod := waitForDeployment(t, ctx, "operator-controller-controller-manager")

+	t.Log("Start measuring leader election time")


I think we need to be careful about how we measure the timing here. What we are measuring right now is the amount of time between:

the test detecting that the operator-controller deployment is finished, and

how long it takes for watchPodLogsForSubstring(leaderElectionCtx, managerPod, "manager", leaderSubstrings...) to return

This may correlate with the time taken for leader election, but it won't necessarily correlate with it. E.g. let's say I upgrade the deployments, go out for lunch for 1h, come back and run the post upgrade test.

Maybe it would be better to extract the timestamp in the first and leader election log lines instead?

Your comment make 100% sense.

To try keep things simple and focused on the goal of this PR, I’ve removed the measurement aspect. Whether we want to include it as info, debug, or decide on a specific measurement approach is a separate discussion. For now, let’s stay within the scope of this change—fixing the test flake and unblocking progress.

TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired. This fix ensures the test explicitly waits for leader election logs (`"successfully acquired lease"`) before verifying reconciliation.

perdasilva

lgtm ^^

LalatenduMohanty · 2025-01-31T14:30:01Z

test/upgrade-e2e/post_upgrade_test.go

+	t.Log("Wait for acquired leader election")
+	// Average case is under 1 minute but in the worst case: (previous leader crashed)
+	// we could have LeaseDuration (137s) + RetryPeriod (26s) +/- 163s
+	leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute)


I am assuming 3 minutes is the worst case scenario. I am not familiar with context.WithTimeout , does it return if we acquire the lease before 163s?

context.WithTimeout just gives you a context that timesout (gets cancelled) after then timeout period.
This means that the call to watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...) will return with an error if it hasn't already after 3 minutes.

ok, looks like it is a straight forward timeout method.

bentito · 2025-01-31T14:36:13Z

test/upgrade-e2e/post_upgrade_test.go

+	defer leaderCancel()
+
+	leaderSubstrings := []string{"successfully acquired lease"}
+	leaderElected, err := watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...)


Scraping the logs seems brittle.

Would it be better to use a Watch on the leader election? We could use the Leases from CoordinationV1Client from "k8s.io/client-go/kubernetes/typed/coordination/v1" ?

I realize it's also longer and more code, but the upside is it reacts right away, like watching for the pod log, but without caring if strings change at some point, and break our tests out of our control.

that's a good idea! If this work is blocking CI, I'd say merge it as it is, then follow up with the watch ^^

Yes, I agree that we could do something more fancy
But we check the logs in many places, indeed below.
We can see if we improve after, but there is no reason for us to face the pain of the flak.

camilamacedo86 requested a review from a team as a code owner January 31, 2025 03:42

camilamacedo86 force-pushed the fix-test-leader-election branch 4 times, most recently from a96e69f to 6b04b01 Compare January 31, 2025 04:09

camilamacedo86 force-pushed the fix-test-leader-election branch 3 times, most recently from 0ee61e2 to 4238c2b Compare January 31, 2025 05:07

bentito reviewed Jan 31, 2025

View reviewed changes

cmd/operator-controller/main.go Outdated Show resolved Hide resolved

perdasilva reviewed Jan 31, 2025

View reviewed changes

camilamacedo86 force-pushed the fix-test-leader-election branch from 4238c2b to 25ffe30 Compare January 31, 2025 14:21

camilamacedo86 changed the title ~~🌱 fix(e2e): wait for leader election & measure timing for better monitoring~~ 🌱 fix(e2e): wait for leader election Jan 31, 2025

camilamacedo86 requested review from bentito and perdasilva January 31, 2025 14:21

perdasilva approved these changes Jan 31, 2025

View reviewed changes

LalatenduMohanty reviewed Jan 31, 2025

View reviewed changes

bentito reviewed Jan 31, 2025

View reviewed changes

camilamacedo86 closed this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 fix(e2e): wait for leader election #1676

🌱 fix(e2e): wait for leader election #1676

camilamacedo86 commented Jan 31, 2025 •

edited

Loading

netlify bot commented Jan 31, 2025 •

edited

Loading

codecov bot commented Jan 31, 2025 •

edited

Loading

camilamacedo86 commented Jan 31, 2025

perdasilva Jan 31, 2025 •

edited

Loading

camilamacedo86 Jan 31, 2025

perdasilva left a comment

LalatenduMohanty Jan 31, 2025

perdasilva Jan 31, 2025

LalatenduMohanty Jan 31, 2025

bentito Jan 31, 2025

bentito Jan 31, 2025 •

edited

Loading

perdasilva Jan 31, 2025

camilamacedo86 Jan 31, 2025

🌱 fix(e2e): wait for leader election #1676

🌱 fix(e2e): wait for leader election #1676

Conversation

camilamacedo86 commented Jan 31, 2025 • edited Loading

netlify bot commented Jan 31, 2025 • edited Loading

✅ Deploy Preview for olmv1 ready!

codecov bot commented Jan 31, 2025 • edited Loading

Codecov Report

camilamacedo86 commented Jan 31, 2025

perdasilva Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

camilamacedo86 Jan 31, 2025

Choose a reason for hiding this comment

perdasilva left a comment

Choose a reason for hiding this comment

LalatenduMohanty Jan 31, 2025

Choose a reason for hiding this comment

perdasilva Jan 31, 2025

Choose a reason for hiding this comment

LalatenduMohanty Jan 31, 2025

Choose a reason for hiding this comment

bentito Jan 31, 2025

Choose a reason for hiding this comment

bentito Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

perdasilva Jan 31, 2025

Choose a reason for hiding this comment

camilamacedo86 Jan 31, 2025

Choose a reason for hiding this comment

camilamacedo86 commented Jan 31, 2025 •

edited

Loading

netlify bot commented Jan 31, 2025 •

edited

Loading

codecov bot commented Jan 31, 2025 •

edited

Loading

perdasilva Jan 31, 2025 •

edited

Loading

bentito Jan 31, 2025 •

edited

Loading